Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inbound: Improve policy metrics #1237

Merged
merged 48 commits into from
Sep 3, 2021
Merged

inbound: Improve policy metrics #1237

merged 48 commits into from
Sep 3, 2021

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented Sep 2, 2021

We recently introduced metrics to help surface inbound policy decisions,
but in practice these haven't been as useful as we might hope.
Specifically, error metrics do not include the target_addr label so
these metrics can't be correlated with servers, etc. This change
improves error metrics and also introduces new metrics to describe
authorization decisions: authorization denials shouldn't be classified
as errors, really, anyway.

This change also improves TCP forwarding authorization so that policy
changes can be honored at runtime: previously authorized connections may
dropped if the policy is updated so that the connection is no longer
authorized.

The gateway is also updated to enforce HTTP policies at runtime as well
so that policy changes can be honored after the connection has been
established.

This change introduces new metrics:

  • inbound_http_authz_allow_total
  • inbound_http_authz_deny_total
  • inbound_tcp_authz_allow_total
  • inbound_tcp_authz_deny_total
  • inbound_tcp_authz_terminate_total

allow metrics include target_addr, srv_name, and saz_name
labels. deny and terminate metics include only target_addr and
srv_name labels.

Authorization denials are no longer reflected in inbound_tcp_error or
inbound_http_error metrics.

A number of internal changes have been made to support this:

  • The inbound::policy::authorize module includes middlewares for TCP
    and HTTP authorization, replacing the prior method of enforcing policy
    in the stack/router. This module ensures that metrics are recorded for
    policy decisions.
  • The error-metrics crate has been removed. In its place a monitor
    type has been added to the stack crate, supporting a general way to
    observe errors, decoupled from the metrics registry.
  • Inbound and outbound error metrics are now tracked in the inbound and
    outbound crates, respectively. Inbound- and outbound-specific error
    types are also moved into their rspective crates.
  • The app_core::errors module has been updated to only define the
    types it needs to instrument the error response layer. Error responses
    are now primarily instrumented via the HttpError type so that errors
    that should be handled can be configured where the error is thrown.
    The error type now holds an underlying source error so that the error
    metrics layer can see through this wrapper type to track the
    underlying error cause.
  • Server & Authorization labels are no longer handled as a free-form
    maps. We currently read only the name label from each; and this
    label is required.

We recently introduced metrics to help surface inbound policy decisions,
but in practice these haven't been as useful as we might hope.
Specifically, error metrics do not include the `target_addr` label so
these metrics can't be correlated with servers, etc. This change
improves error metrics and also introduces new metrics to describe
authorization decisions: authorization denials shouldn't be classified
as errors, really, anyway.

This change also improves TCP forwarding authorization so that policy
changes can be honored at runtime: previously authorized connections may
dropped if the policy is updated so that the connection is no longer
authorized.

The gateway is also updated to enforce HTTP policies at runtime as well
so that policy changes can be honored after the connection has been
established.

This change introduces new metrics:

* `inbound_http_authz_allow_total`
* `inbound_http_authz_deny_total`
* `inbound_tcp_authz_allow_total`
* `inbound_tcp_authz_deny_total`
* `inbound_tcp_authz_terminate_total`

_allow_ metrics include `target_addr`, `srv_name`, and `saz_name`
labels. _deny_ and _terminate_ metics include only `target_addr` and
`srv_name` labels.

Authorization denials are no longer reflected in inbound_tcp_error or
inbound_http_error metrics.

A number of internal changes have been made to support this:

* The `inbound::policy::authorize` module includes middlewares for TCP
  and HTTP authorization, replacing the prior method of enforcing policy
  in the stack/router. This module ensures that metrics are recorded for
  policy decisions.
* The `error-metrics` crate has been removed. In its place a `monitor`
  type has been added to the `stack` crate, supporting a general way to
  observe errors, decoupled from the metrics registry.
* Inbound and outbound error metrics are now tracked in the inbound and
  outbound crates, respectively. Inbound- and outbound-specific error
  types are also moved into their rspective crates.
* The `app_core::errors` module has been updated to only define the
  types it needs to instrument the error response layer. Error responses
  are now primarily instrumented via the `HttpError` type so that errors
  that should be handled can be configured where the error is thrown.
  The error type now holds an underlying source error so that the error
  metrics layer can see through this wrapper type to track the
  underlying error cause.
* Server & Authorization labels are no longer handled as a free-form
  maps. We currently read only the `name` label from each; and this
  label is required.
@olix0r olix0r requested a review from a team September 2, 2021 20:57
Copy link
Contributor

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, this looks good to me, i had mostly minor comments. i do think we may want to just replace the RwLocks with Mutexes, because it seems like we're not really getting any benefit from them --- we never have concurrent reads, and it might just be adding overhead.

Comment on lines +160 to +161
// should not be calling poll_data if we have set trailers derived from an error
assert!(trailers.is_none());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/TIOLI: should we maybe just move the comment into the panic message for the assert?

linkerd/app/inbound/src/detect.rs Outdated Show resolved Hide resolved
@@ -355,6 +345,7 @@ impl tap::Inspect for Logical {
}

fn dst_labels<B>(&self, _: &http::Request<B>) -> Option<&tap::Labels> {
// TODO include policy labels here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be addressed as part of this PR, or is it a follow-up change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a followup change already up

linkerd/app/inbound/src/lib.rs Outdated Show resolved Hide resolved
linkerd/app/inbound/src/metrics/authz.rs Outdated Show resolved Hide resolved
linkerd/app/inbound/src/metrics/error/tcp.rs Outdated Show resolved Hide resolved
linkerd/app/inbound/src/policy/mod.rs Outdated Show resolved Hide resolved
linkerd/stack/src/monitor.rs Show resolved Hide resolved
@olix0r olix0r merged commit 29c22af into main Sep 3, 2021
@olix0r olix0r deleted the ver/monitor branch September 3, 2021 02:26
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Sep 3, 2021
This release features some performance improvements: tokio has been
updated to pick up tokio-rs/tokio#4055, and link-time optimizations have
been enabled in release builds. These changes reduce CPU and memory
overhead in benchmarks.

Inbound policy enforcement has been updated so that TCP forwarding is
interrupted if a policy update revokes a previously-established
authorization. New metrics are exposed to reflect how policies are used
by the proxy: `inbound_http_authz_{allow,deny}_total` and
`inbound_tcp_authz_{allow,deny,terminate}_total`.

The proxy's error metrics, `{inbound,outbound}_{http,tcp}_errors_total`,
have been updated to include the traffic target. And the `traffic_addr`
metric label is augmented by `target_ip` and `target_port` labels to
support more flexible prometheus queries.

Inbound TCP metrics now only include a `srv_name` label, as it can't be
expected for all inbound connections to include authorization labels
(hence the new authz metrics). However, all inbound HTTP metrics--except
for the HTTP errors metric, which includes only a `srv_name`
label--include both `srv_name` and `saz_name` label.

Finally, the inbound and outbound proxies now only exports
Route-oriented metrics when a ServiceProfile is enabled, preventing
redundant metrics from being exported with no differentiating labels.

---

* profiles: Avoid creating a default route stack (linkerd/linkerd2-proxy#1223)
* build(deps): bump arbitrary from 1.0.1 to 1.0.2 (linkerd/linkerd2-proxy#1224)
* build(deps): bump trust-dns-resolver from `f08860c` to `3d0667a` (linkerd/linkerd2-proxy#1225)
* build(deps): bump libc from 0.2.100 to 0.2.101 (linkerd/linkerd2-proxy#1226)
* Enable link-time optimizations (linkerd/linkerd2-proxy#1227)
* build(deps): bump serde_json from 1.0.66 to 1.0.67 (linkerd/linkerd2-proxy#1228)
* build(deps): bump flate2 from 1.0.20 to 1.0.21 (linkerd/linkerd2-proxy#1230)
* build(deps): bump thiserror from 1.0.26 to 1.0.28 (linkerd/linkerd2-proxy#1231)
* build(deps): bump futures from 0.3.16 to 0.3.17 (linkerd/linkerd2-proxy#1232)
* build(deps): bump parking_lot from 0.11.1 to 0.11.2 (linkerd/linkerd2-proxy#1234)
* build(deps): bump trust-dns-resolver from `3d0667a` to `v0.21.0-alpha.2` (linkerd/linkerd2-proxy#1233)
* Rename push_on_response to push_on_service (linkerd/linkerd2-proxy#1235)
* build(deps): bump tokio from 1.10.1 to 1.11.0 (linkerd/linkerd2-proxy#1236)
* metrics: Add `target_ip` and `target_port` labels (linkerd/linkerd2-proxy#1238)
* inbound: Improve policy metrics (linkerd/linkerd2-proxy#1237)
* inbound: Include server labels in tap responses (linkerd/linkerd2-proxy#1239)
* Revert rustc update for release builds
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Sep 3, 2021
This release features some performance improvements: tokio has been
updated to pick up tokio-rs/tokio#4055, and link-time optimizations have
been enabled in release builds. These changes reduce CPU and memory
overhead in benchmarks.

Inbound policy enforcement has been updated so that TCP forwarding is
interrupted if a policy update revokes a previously-established
authorization. New metrics are exposed to reflect how policies are used
by the proxy: `inbound_http_authz_{allow,deny}_total` and
`inbound_tcp_authz_{allow,deny,terminate}_total`.

The proxy's error metrics, `{inbound,outbound}_{http,tcp}_errors_total`,
have been updated to include the traffic target. And the `traffic_addr`
metric label is augmented by `target_ip` and `target_port` labels to
support more flexible prometheus queries.

Inbound TCP metrics now only include a `srv_name` label, as it can't be
expected for all inbound connections to include authorization labels
(hence the new authz metrics). However, all inbound HTTP metrics--except
for the HTTP errors metric, which includes only a `srv_name`
label--include both `srv_name` and `saz_name` label.

Finally, the inbound and outbound proxies now only exports
Route-oriented metrics when a ServiceProfile is enabled, preventing
redundant metrics from being exported with no differentiating labels.

---

* profiles: Avoid creating a default route stack (linkerd/linkerd2-proxy#1223)
* build(deps): bump arbitrary from 1.0.1 to 1.0.2 (linkerd/linkerd2-proxy#1224)
* build(deps): bump trust-dns-resolver from `f08860c` to `3d0667a` (linkerd/linkerd2-proxy#1225)
* build(deps): bump libc from 0.2.100 to 0.2.101 (linkerd/linkerd2-proxy#1226)
* Enable link-time optimizations (linkerd/linkerd2-proxy#1227)
* build(deps): bump serde_json from 1.0.66 to 1.0.67 (linkerd/linkerd2-proxy#1228)
* build(deps): bump flate2 from 1.0.20 to 1.0.21 (linkerd/linkerd2-proxy#1230)
* build(deps): bump thiserror from 1.0.26 to 1.0.28 (linkerd/linkerd2-proxy#1231)
* build(deps): bump futures from 0.3.16 to 0.3.17 (linkerd/linkerd2-proxy#1232)
* build(deps): bump parking_lot from 0.11.1 to 0.11.2 (linkerd/linkerd2-proxy#1234)
* build(deps): bump trust-dns-resolver from `3d0667a` to `v0.21.0-alpha.2` (linkerd/linkerd2-proxy#1233)
* Rename push_on_response to push_on_service (linkerd/linkerd2-proxy#1235)
* build(deps): bump tokio from 1.10.1 to 1.11.0 (linkerd/linkerd2-proxy#1236)
* metrics: Add `target_ip` and `target_port` labels (linkerd/linkerd2-proxy#1238)
* inbound: Improve policy metrics (linkerd/linkerd2-proxy#1237)
* inbound: Include server labels in tap responses (linkerd/linkerd2-proxy#1239)
* Revert rustc update for release builds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants